Network Controller for Managing Software and Hardware Forwarding Elements

ABSTRACT

Some embodiments provide a set of one or more network controllers that communicates with a wide range of devices, ranging from switches to appliances such as firewalls, load balancers, etc. The set of network controllers communicates with such devices to connect them to its managed virtual networks. The set of network controllers can define each virtual network through software switches and/or software appliances. To extend the control beyond software network elements, some embodiments implement a database server on each dedicated hardware. The set of network controllers accesses the database server to send management data. The hardware then translates the management data to connect to a managed virtual network.

CLAIM OF BENEFIT TO PRIOR APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication 61/887,117, entitled “Managing Software and HardwareForwarding Elements to Define Virtual Networks”, filed Oct. 4, 2013.U.S. Provisional Patent Application 61/887,117 is incorporated herein byreference.

BACKGROUND

Many current enterprises have large and sophisticated networkscomprising switches, hubs, routers, servers, workstations and othernetworked devices, which support a variety of connections, applicationsand systems. The increased sophistication of computer networking,including virtual machine migration, dynamic workloads, multi-tenancy,and customer specific quality of service and security configurationsrequire a better paradigm for network control. Networks havetraditionally been managed through low-level configuration of individualcomponents. Network configurations often depend on the underlyingnetwork: for example, blocking a user's access with an access controllist (“ACL”) entry requires knowing the user's current IP address. Morecomplicated tasks require more extensive network knowledge: forcingguest users' port 80 traffic to traverse an HTTP proxy requires knowingthe current network topology and the location of each guest. Thisprocess is of increased difficulty where the network switching elementsare shared across multiple users.

In response, there is a growing movement, driven by both industry andacademia, towards a new network control paradigm called Software-DefinedNetworking (SDN). In the SDN paradigm, a network controller, running onone or more servers in a network, controls, maintains, and implementscontrol logic that governs the forwarding behavior of shared networkswitching elements on a per user basis. Making network managementdecisions often requires knowledge of the network state. To facilitatemanagement decision-making, the network controller creates and maintainsa view of the network state and provides an application programminginterface upon which management applications may access a view of thenetwork state.

Three of the many challenges of large networks (including datacentersand the enterprise) are scalability, mobility, and multi-tenancy andoften the approaches taken to address one hamper the other. Currently,instead of a hardware switch, an extender or a L2 gateway is used tobridge the physical network to the virtual network. For instance, the L2gateway may be used to connect two managed environments, or connect amanaged environment and an unmanaged environment. The extender is an x86box. As such, its throughput is less than dedicated hardware. Theextender also introduces one extra hop that can hinder performance andlatency.

BRIEF SUMMARY

Some embodiments provide a set of one or more network controllers thatcommunicates with a wide range of devices (e.g., third-party hardware),ranging from switches to appliances such as firewalls, load balancers,etc. The set of network controllers communicates with such devices toconnect them to its managed virtual networks. The set of networkcontrollers can define each virtual network through software switchesand/or software appliances. To extend the control beyond softwarenetwork elements, some embodiments implement a database server ondedicated hardware and control the hardware through the database serverusing a database protocol. For instance, the set of network controllersaccesses the database server to send management data. The hardware thentranslates the management data to connect to a managed virtual network.

In some embodiments, the database server is designed to handletransactions and deal with conflicts in having multiple writers (e.g.,more than one network controllers writing to a database). As an example,for a given transaction, the database server of some embodimentsexecutes a set of operations on a database. When there are multipleoperations, the database server may execute each operation in aspecified order, except that if an operation fails, then the remainingoperations are not executed. The set of operations is executed as asingle atomic, consistent, isolated transaction. The transaction iscommitted only if each and every operation succeeds. In this manner, thedatabase server provide a means of transacting with a client thatmaintains the reliability of the database even when there is a failurein completing each operation in the set of operations.

To prevent multiple clients writing to the database at the same time,the database server may support various lock operations to lock orunlock the database. In some embodiments, each client must obtain aparticular lock from the database server before the client can write toa certain table of the database. The database will assign the clientownership of the lock as soon as it becomes available. When multipleclients request the same lock, they will receive it in first-come, firstserved order. After receiving the lock, the client can then write to thedatabase and release the lock by perform an unlock operation.

In some embodiments, the database server supports bi-directionalasynchronous notifications. For example, when there is an update to adatabase table, the database server sends a notification regarding anupdate to a client (e.g., executing on a network controller or on thehardware). The notification may include a copy of the table or a subsetof the table (e.g., a record) that was updated. In some embodiments, theprotocol's update call is used to exchange forwarding state. Forinstance, if the hardware is a switch, it can publish its forwardingstate by having its database client write (e.g., a learned MAC address)to a database table. This will in turn cause the database server to pushthe update to a database client executing on a network controller. Thenetwork controller can then notify other network elements (e.g., thesoftware switches) regarding the update. When a network controller isnotified of an update to a software switch, the network controller'sclient accesses the database server executing on the hardware switch towrite to a database table. This will in turn cause the database serverto push the update to the database client executing on the hardwareswitch. The hardware switch's software stack can then translate theupdate to a flow that it understands to process packets.

The database server of some embodiments maintains a database byperforming garbage collection operations to remove database entries(e.g., records) that are not used. In some embodiments, the garbagecollection is performed at runtime (e.g., with each given transaction),performed periodically (e.g., at set interval as a background task), orperformed when triggered (e.g., when there is a change to the databaseor a change to a particular table of the database). In some embodiments,if a table entry (e.g., a record) is not part of a root set and theentry has no reference or a reference of a particular importance (e.g.,a strong reference), then the table entry is subject to garbagecollection. The garbage collection process prevents the database fromcontinually growing in size over time with unused data.

Some embodiments provide a network controller that manages software andhardware switching elements. The network controller sends managementdata to the hardware switching element using a protocol to add thehardware switching element to a virtual network. To manage trafficbetween the hardware and software switching element, the networkcontroller exchanges forwarding states with the hardware switchingelement through the protocol's asynchronous notification. The forwardingstate of the software switching element is asynchronously sent from thenetwork controller to the hardware switching element when the software'sforwarding state has changed. The forwarding state of the hardwareswitching element is asynchronously received at the network controllerfrom the hardware switching element when the hardware's forwarding statehas changed.

The network controller of some embodiments facilitates in implementing alogical switching element from software and hardware switching elements.In facilitating, the network controller sends a first transaction thatinstructs a database server on the hardware switching element to writeto a database a logical forwarding element identifier (LFEI) thatidentifies a logical switch. The network controller sends a secondtransaction that instructs the database server on the hardware switchingelement to write to the database an address of at least one softwareswitching element that use the LFEI. The hardware switching element usesthe address to establish a tunnel between the hardware switchingelements and the software switching element. The hardware and softwareswitching elements implements the logical switching element by sendingpackets over the established tunnel using the LFEI.

Embodiments described herein provide a system for controlling forwardingelements. The system includes a network controller (e.g., that operateson a computing device) to generate and send forwarding instructions toseveral forwarding elements in a virtual network, including software andhardware forwarding elements. The system includes a service node (e.g.,that operates on another computing device or the same computing device)to (1) receive, based on the forwarding instructions, each unknownunicast packet from a software or a hardware forwarding element in theplurality of forwarding elements, (2) replicate the packet, and (3) sendthe packet to each other hardware forwarding element. The unknownunicast packet is sent to each particular hardware forwarding element inthe virtual network so that the particular hardware forwarding elementidentifies whether a machine connected to the hardware forwardingelement's port has the same address as the destination addressassociated with the packet, and output the packet to the port if theaddresses are the same. For example, to find a matching address, thehardware forwarding element may flood some or all of its ports andrecord the MAC address of the packet that responds to the flood.

In some embodiments, the service node receives, based on the forwardinginstructions from the network controller, each multicast packet from asoftware or a hardware forwarding element. The service node thenreplicates the packet and sends the packet to each other hardwareforwarding element. The multicast packet is sent to each particularhardware forwarding element in the virtual network so that theparticular hardware forwarding element identifies whether a machineconnected to the hardware forwarding element's port is a part of thevirtual network, and output the packet to the port if the machine is apart of the virtual network.

The preceding Summary is intended to serve as a brief introduction tosome embodiments as described herein. It is not meant to be anintroduction or overview of all subject matter disclosed in thisdocument. The Detailed Description that follows and the Drawings thatare referred to in the Detailed Description will further describe theembodiments described in the Summary as well as other embodiments.Accordingly, to understand all the embodiments described by thisdocument, a full review of the Summary, Detailed Description and theDrawings is needed. Moreover, the claimed subject matters are not to belimited by the illustrative details in the Summary, Detailed Descriptionand the Drawings, but rather are to be defined by the appended claims,because the claimed subject matters can be embodied in other specificforms without departing from the spirit of the subject matters.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth in the appendedclaims. However, for purposes of explanation, several embodiments of theinvention are set forth in the following figures.

FIG. 1 illustrates an example of how a network controller clustermanages software and hardware switches to create virtual networks.

FIG. 2 provides an illustrative example of a network controller clusterthat communicates with a hardware switch.

FIG. 3 is a system diagram that illustrates the network controllercluster 130 communicating with both hardware and software switches.

FIG. 4 conceptually illustrates an architectural diagram of an examplehardware switch.

FIG. 5 provides an illustrative example of how a database server handlesconflicts in having multiple network controllers attempting to update adatabase at the same time.

FIG. 6 illustrates an example of a client on dedicated hardware beingnotified of an update to a database table.

FIG. 7 illustrates an example of a client on a network controller beingnotified of an update to a database table.

FIG. 8 conceptually illustrates a process that some embodiments performon a given transaction.

FIG. 9 conceptually illustrates an example of performing a garbagecollection operation on a database.

FIG. 10 illustrates an example of a controller cluster that communicateswith different edge devices to create tunnels.

FIG. 11 shows an example physical topology with several tunnelendpoints.

FIG. 12 shows an example logical topology with two logical switches.

FIG. 13 conceptually illustrates a process that some embodiments performto access such a database to implement virtual networks.

FIG. 14 provides a data flow diagram that shows a top of rack (TOR)switch that publishes a MAC address of a machine that is connected toone of its port.

FIG. 15 illustrates a data flow diagram that shows a MAC address of avirtual machine being pushed to the TOR switch.

FIG. 16 illustrates a packet flow diagram flow for a known unicast.

FIG. 17 illustrates a packet flow diagram flow for a multicast.

FIG. 18 shows a packet flow diagram for an Address Resolution Protocol(ARP) request.

FIG. 19 illustrates a packet flow diagram flow for an unknown unicast.

FIG. 20 conceptually illustrates an example of a network control systemof some embodiments that manages hardware and software forwardingelements.

FIG. 21 conceptually illustrates an example of a network control systemof some embodiments that manages hardware and software forwardingelements.

FIG. 22 conceptually illustrates an electronic system with which someembodiments of the invention are implemented.

DETAILED DESCRIPTION

Embodiments described herein provide a set of one or more networkcontrollers that communicates with a wide range of devices (e.g.,third-party hardware), ranging from switches to appliances such asfirewalls, load balancers, etc. The set of network controllerscommunicates with such devices to connect them to its managed virtualnetworks. The set of network controllers can define each virtual networkthrough software switches and/or software appliances. To extend thecontrol beyond software network elements, some embodiments implement adatabase server on dedicated hardware and control the hardware throughthe database server using a database protocol. For instance, the set ofnetwork controllers accesses the database server to send managementdata. The hardware then translates the management data to connect to amanaged virtual network.

In some embodiments, the database server is designed to handletransactions and deal with conflicts in having multiple writers (e.g.,more than one network controllers writing to a database). As an example,for a given transaction, the database server of some embodimentsexecutes a set of operations on a database. When there are multipleoperations, the database server may execute each operation in aspecified order, except that if an operation fails, then the remainingoperations are not executed. To prevent multiple clients writing to thedatabase at the same time, the database server may support various lockoperations to lock or unlock the database.

The database server of some embodiments supports other types of calls(e.g., remote procedure calls) using the database protocol, including anupdate notification. When there is an update to a database table, thedatabase server sends a notification regarding an update to a client(e.g., executing on a network controller or on the hardware). Thenotification may include a copy of the table or a subset of the table(e.g., a record) that was updated. In some embodiments, the protocol'supdate call is used to exchange management data and forwarding state.

For some embodiments of the invention, FIG. 1 illustrates an example ofhow a network controller cluster 130 communicates with severalforwarding elements 115-125 to create virtual networks. Specifically,this figure shows the controller cluster exchanging management andconfiguration data with the forwarding elements to create tunnels. Thetunnels extend the virtual network overlay among software or virtualswitches 115 and 120 and the physical switch 125. Two operational stages105 and 110 are shown in this figure.

In some embodiments, the forwarding elements 115-125 can be configuredto route network data (e.g., packets) between network elements 130-150(e.g., virtual machines, computers) that are coupled to the forwardingelements. For instance, the software switch 115 can route data to thesoftware switch 120, and can route data to the physical switch 125. Inaddition, the software switch 120 can route data to the software switch115 and the physical switch 125, while the physical switch can routedata to the two software switches.

The software switch (115 or 120) operates on a host (e.g., ahypervisor). As an example, the software switch 115 may operate on adedicated computer, or on a computer that performs other non-switchingoperations. The physical switch 115 has dedicated hardware to forwardpackets. That is, different from the software switches 115 and 120, thephysical switch 115 has application-specific integrated circuits (ASICs)that are specifically designed to support in-hardware forwarding. Theterm “packet” is used here as well as throughout this application torefer to a collection of bits in a particular format sent across anetwork. One of ordinary skill in the art will recognize that the term“packet” may be used herein to refer to various formatted collections ofbits that may be sent across a network, such as Ethernet frames, TCPsegments, UDP datagrams, IP packets, etc.

The network controller cluster 130 manages and configures the forwardingelements 120 to create virtual networks. In some embodiments, thenetwork controller cluster 130 communicates with the software switches115 and 120 using different protocols. Specifically, the protocolsinclude an OpenFlow protocol and a database protocol. The OpenFlowprotocol is primary used to inspect and modify a set of one or more flowtables (analogous to a set of flow tables in the physical switch 125).The network controller cluster computes flows and pushes them to thesoftware switches 115 and 120 through the OpenFlow channel. The networkcontroller cluster 130 communicates with the software switches 115 and120 using the database protocol to create and manage overlay tunnels totransport nodes. The network controller cluster 130 might also use thisprotocol for discovery purposes (e.g., discover which virtual machinesare hosted at a hypervisor).

For the software switches 115 and 120, the network controller cluster130 uses these two protocols to implement packet forwarding for logicaldatapaths. Each logical datapath comprises a series (pipeline) oflogical flow tables, each with its own globally unique identifier. Thetables include a set of flow entries that specify expressions to matchagainst the header of a packet, and actions to take on the packet when agiven expression is satisfied. Possible actions include modifying apacket, dropping a packet, sending it to a given egress port on thelogical datapath, and writing in-memory metadata (analogous to registerson the physical switch 125) associated with the packet and resubmittingit back to the datapath for further processing. A flow expression canmatch against this metadata, in addition to the packet's header. Thenetwork controller cluster 130 writes the flow entries for each logicaldatapath to a single software switch flow table at each software switchthat participates in the logical datapath.

Different from the software switches 115 and 120, the network controllercluster 130 communicates primarily with the physical switch 125 usingwith one protocol, namely the database protocol. Through the databasechannel, the network controller cluster 130 reads the configurations ofthe physical switch (e.g., an inventory of its physical ports) and sendsmanagement data to the physical switch. For example, a networkcontroller might send instructions to the physical switch 125 to createtunnel ports for a logical switch.

In some embodiments, the network controller cluster 130 communicateswith the physical switch over the database channel to exchangesforwarding state (e.g., L2 and/or L3 forwarding state). For instance,the physical switch 125 might send an update notification to a networkcontroller regarding a learned MAC address of a machine (e.g., desktopcomputer, laptop) that is connected to its port. The network controllercan then compute a flow and push the flow down to the software switchesusing the OpenFlow channel. The network controller might also send tothe physical switch 125 the MAC addresses of the machines 130-140 thatare coupled to the software switches 115 and 120.

The forwarding state is exchanged with the physical switch 125 becausethe flows of the physical switch is ASIC based and not OpenFlow based asin the software switches 115 and 120. In other words, the networkcontroller cluster 130 do not compute flows for the physical switch 125and push the flows to the physical switch through the database channel.The physical switch 125 computes its own flows based on the forwardinginformation, and how the physical switch computes the flows can varyfrom one switch vendor to another.

Having described several components of FIG. 1, an example operation ofthese components will now be described by reference to the two stages105 and 110 that are shown in the figure. The first stage 105illustrates an example control plane implementation to create tunnelsbetween the forwarding elements 115-125. In particular, the networkcontroller cluster 130 exchanges management data with the softwareswitches 115 and 120 using the database protocol. The network controllercluster 130 also computes and pushes flows to the software switches 115and 120 through the OpenFlow channel. The network controller cluster 130communicates with the physical switch to send management data using thedatabase protocol. The dedicated hardware then translates the managementdata to define a tunnel between the hardware and each of the softwareswitches (115 and 120). Furthermore, the network controller cluster 130and the physical switch 125 use the database protocol to exchangeforwarding state.

The second stage 110 shows an example data plane view of the virtualnetwork. In this example, several tunnels are established between eachtwo forwarding elements. Namely, a tunnel is established between the twosoftware switches 115 and 120. There is also a tunnel between thephysical switch 125 and the software switch 115, and between thephysical switch 125 and the software switch 115. In some embodiments,these tunnels provide a virtual unicast or multicast link (e.g.,Ethernet link) between the forwarding elements 115-125, which forwardunencapsulated frames to and from attached components such as themachines 130-150 (e.g., VMs or physical links).

In the example described above, the network controller cluster 130defines the physical switch 125 to be a tunnel endpoint. Normally, anextender or a gateway is used to connect two managed environments, or amanaged environment and an unmanaged environment. The extender is an x86box (e.g., that runs the software switch). As the physical switch 125 ismanaged by network controller cluster 130, the system does not introducesuch an extender in between the physical switch 125 and the machines 145and 150. Thus, there is no one extra hop that can hinder performance andlatency, and with possible hardware specifications that provide lessthroughput than with the physical switch's dedicated hardware.

Many more examples of managing hardware and software forwarding elementsare described below. Specially, Section I show several an exampledatabase protocol that is used to extend the control to hardwareswitches and appliances. This is followed by Section II that describesseveral example features of a database server that is installed onhardware. Next, Section III describes examples of creating virtualnetworks with hardware and software switches. Section IV then describesexamples of exchanging forwarding state with a physical switch. SectionV then describes several examples packet flows, including known unicast,unknown unicast, and multicast. Section VI section then describe theenvironment in which some embodiments of the inventions are implemented.Finally, Section VII describes an example of an electronic system thatimplement some embodiments described herein.

I. Extending the Management beyond Software Switches and Appliances

In some embodiments, the network controller cluster defines each virtualnetwork through software switches and/or software appliances. To extendthe control beyond software network elements, some embodiments implementa database server on dedicated hardware. The set of network controllersaccesses the database server to send management data. The dedicatedhardware then translates the management data to connect to a managedvirtual network.

FIG. 2 provides an illustrative example of a network controller clusterthat communicates with a hardware switch. The network controller clusterexchanges data with such hardware to connect the hardware to the virtualnetworks. This figure shows the network controller cluster 130 and thephysical switch 125. The physical switch 125 includes a database server210, a software stack 220, and a switch ASIC 225. The software stack hasa database client 215. Similarly, each network controller has a databaseclient 205.

In some embodiments, the database server 210 is designed to handletransactions and deal with conflicts in having multiple writers (e.g.,the more than one network controllers writing to a database). Thedatabase server 210 is also designed to provide asynchronousnotifications. For example, when there is an update to a database table,the database server 210 sends a notification regarding an update to aclient (e.g., executing on a network controller or on the hardware). Thenotification may include a copy of the table or a subset of the table(e.g., a record) that was updated. In the example of FIG. 2, thedatabase server 210 is an Open Virtual Switch (OVS) database server thataccesses the OVS database 220.

The switch software stack 220 represents several programs that operateon the physical switch 125. The software stack can include a variety ofdifferent programs to configure and manage the switch. This can includemanagement that is in and outside of the scope of the network controllercluster 130. For instance, the software stack may include a program toupdate its firmware, modify switch settings (e.g., its administrativepassword), and/or reset the switch. The software stack 220 is vendorspecific, which means that it can change from one vendor to anothervendor. In addition, different vendors might provide different featuresthat are represented by their corresponding software stack 220.

In FIG. 2, the software stack 120 includes at least one module toprogram the switch ASIC. The switch ASIC is a component, which isspecifically designed to support in- hardware forwarding. That is, it isprimarily designed to quickly forward packets. To simplify thedescription, only one switching ASIC is shown. However, one of ordinaryskill in the art would understand that the physical switch could includea number of ASICs that operate in conjunctions with one another toforward packets.

As shown, the database server 210 communicates with multiple clients 205and 215. Each client access the database 220 through the database server210. Each client reads and writes to the database 220 using the databaseprotocol. Each client may also be notified of an update to the database220 (e.g., a table, a subset of a table). In some embodiments, thedatabase protocol specifies a monitor call, which is a request, sentfrom a database client (205 or 215) to the database server 210, tomonitor one or more columns of a table and receive updates when there isan update to the one or more columns (e.g., a new row value, an updateto an existing row value, etc.). For instance, the network controller'sclient 205 may be notified when the switch's client 215 updates thedatabase 220, and vice versa.

In some embodiments, the database server 210 is used to write forwardingstate to the database 220. The physical switch 125 is in a sensepublishing its own forwarding tables in the database 220. As an example,the database client 215 on the switch software stack may update thedatabase 220 with MAC addresses of a machine that is connected to itsport. This would in turn cause the database server 210 to send anotification regarding the update to the client 205 on a networkcontroller.

Similar to the physical switch 125, the network controller cluster 205is publishing addresses of other machines (e.g., virtual machines) thatare connected to one or more software switches (not shown). When theclient 204 on the network controller end makes an update to database220, the update would in turn generate a notification for the client 215on the switch software stack 220. The client 220 may then read theupdate, and a software on the stack may program a Content AddressableMemory (CAM) or Ternary CAM (TCAM) in the switch ASIC

In the example described above, the network controller communicates witha dedicated hardware, namely a hardware switch. FIG. 3 is a systemdiagram that illustrates the network controller cluster 130communicating with both hardware and software switches. Specifically,this figure shows how the network controller cluster 130 exchangesmanagement and configuration data (e.g., flows) with a software switchusing the database protocol and the OpenFlow protocol, while exchangemanagement data and forwarding state with the physical switch 125 usingthe database protocol. In this example, the software switch 405 operateson a host 400. The software switch 405 has a database server 410, anOpenFlow agent 415, and a forwarding module 420. The software switch 405of some embodiments is an Open Virtual Switch (OVS). In thoseembodiments, the flow agent 415 may be referred to as an OVS daemon, andthe forwarding module 410 may be referred to as a kernel module.

In some embodiments, the host 400 includes hardware, hypervisor, and oneor more virtual machines (VMs). The hardware may include typicalcomputer hardware, such as processing units, volatile memory (e.g.,random access memory (RAM)), nonvolatile memory (e.g., hard disc drives,optical discs, etc.), network adapters, video adapters, or any othertype of computer hardware. The hardware can also include one or moreNICs, which are typical network interface controllers.

A hypervisor is a software abstraction layer that can run on top of thehardware of the host 400. There are different types of hypervisors,namely Type 1 (bare metal), which runs directly on the hardware of thehost, and Type 2 (hosted), which run on top of the host's operatingsystem. The hypervisor handles various management tasks, such as memorymanagement, processor scheduling, or any other operations forcontrolling the execution of the VMs. Moreover, the hypervisorcommunicates with the VMs to achieve various operations (e.g., settingpriorities). In some embodiments, the hypervisor is a Xen hypervisorwhile, in other embodiments, the hypervisor may be any other type ofhypervisor for providing hardware virtualization of the hardware on thehost 400.

In some embodiments, the software switch 305 runs on a VM. The VM can bea unique virtual machine, which includes a modified Linux kernel (e.g.,to include the OVS kernel module 325). The VM of such embodiments isresponsible for managing and controlling other VMs running on thehypervisor. In some embodiments, the VM includes a user space and theOVS daemon runs as a background process in the user space.

The OVS daemon 320 is a component of the software switch 305 that makesswitching decisions. On the other hand, the kernel module 325 receivesthe switching decisions, caches them, and uses them to process packets.For instance, when a packet comes in, the kernel module 325 first checksa datapath cache (not show) to find a matching flow entry. If nomatching entry is found, the control is shifted to the OVS daemon 320.The OVS daemon 320 examines one or more flow tables to generate a flowto push down to the kernel module 325. In this manner, when anysubsequent packet is received, the kernel module 325 can quickly processthe packet using the cached flow entry. The kernel module 325 provides afast path to process each packet. However, the switching decisions areultimately made through the OVS daemon 320, in some embodiments.

A network controller uses the OpenFlow protocol to inspect and modify aset of one or more flow tables managed by the OVS daemon 320. Thenetwork controller cluster computes flows and pushes them to thesoftware switch 305 through this OpenFlow channel. The networkcontroller communicates with the software switch using the databaseprotocol to create and manage overlay tunnels to transport nodes. Thenetwork controller might also use this protocol for discovery purposes(e.g., discover which virtual machines are hosted at the hypervisor).The OVS daemon 320 also communicates with the database server 310 toaccess management data (e.g., bridge information, virtual interfacesinformation) stored in the database 330.

Different from the software switch, a network controller communicateswith the physical switch 125 using the database protocol. The databaseprotocol is essentially used to control the physical switch 125. Throughthe database channel, the network controller reads the configurations ofthe physical switch (e.g., an inventory of its physical ports) and sendsmanagement data to the physical switch. For example, a networkcontroller might send instructions to the physical switch to createtunnel ports for a logical switch. As mentioned above, the networkcontroller cluster 130 exchanges forwarding state (e.g., L2 and/or L3forwarding state) with the physical switch 125. That is, the networkcontroller instructs the physical switch 125 to program its forwardingtable using the database protocol.

FIG. 4 conceptually illustrates an architectural diagram of an examplehardware switch (e.g., a third-party switch). As illustrated in thisfigure, the switch 125 includes ingress ports 405, egress ports 410, andforwarding tables 415. The switch also includes the database server 210,the database client 215, the SW Stack 220, and the switch ASIC 225,which are describe above by reference to FIG. 2.

The ingress ports 405 conceptually represent a set of ports throughwhich the switch 405 receives network data. The ingress ports 405 mayinclude different amounts of ingress ports in different embodiments. Asshown, the ingress ports405 can receive network data that is external tothe switch 125, which is indicated as incoming packets in this example.When a packet is received through an ingress port, the switch 125 maysend the packet to the switch ASIC 225 so that the packet can be quicklyprocessed.

The egress ports 410 conceptually represent a set of ports through whichthe switching 405 sends network data. The egress ports 410 may includedifferent amounts of egress ports in different embodiments. In someembodiments, some or all of the egress ports 410 may overlap with someor all of the ingress ports 410. For instance, in some such embodiments,the set of ports of the egress ports 410 is the same set of ports as theset of ports of ingress ports 910. As illustrated in FIG. 4, the egressports 410 receive network data after the switching 125 processes thenetwork data based on the forwarding tables 415. When the egress ports405 receive network data (e.g., packets), the switch 125 sends thenetwork data out of the egress ports 410, which is indicated as outgoingpackets in this example, based on an entry in the forwarding tables 415.

In some embodiments, the forwarding tables 415 store active flow tablesand/or flow entries that are used to determine operations for makingswitching decisions. In this example, each flow entry is includes aqualifier and an action. The qualifier defines a set of fields to matchagainst a set of packet header fields. As shown in FIG. 4, the flowentries are stored in memory. The memory can be random access memory(RAM) or some other type of memory such as Content Addressable Memory(CAM) or Ternary Content Addressable Memory (TCAM). For example, avendor may design their Layer 2 switches with CAM for performing Layer 2switching and/or with TCAM for performing Quality of Service (QoS)functions. The switch architecture may support the ability to performmultiple lookups into multiple distinct CAM and/or TCAM regions inparallel. The CAM and TCAM are examples of switching ASICs that somevendors' switches leverage for line-speed fast switching.

As described above, an instance of the database server 220 controlsaccess to the database 220. The database client 215 accesses thedatabase 220 to read and write management data and forwarding state. Inaddition, a database client on the network controller accesses thedatabase 220 to read and write management data and forwarding state. Thedatabase server may 210 send a notification to one database client(e.g., on the switch end) if the other database client (e.g., on thenetwork controlled end) updates a table or a subset of a table of thedatabase 220.

One other distinction to note is that the hardware switch's model ismore generalized than that of the software switch's model. In thesoftware switch, the network controller has specific knowledge of howforwarding works, and takes advantage of it. On the other hand, theoperations of the hardware switch can vary from one third-party vendorto another. Therefore, in the hardware model, database is more abstractin that it contains the basic information to manage the hardware andexchange forwarding state.

II. Example Database Operations

Some embodiments implement a database server on dedicated hardware. Insome embodiments, the network controller cluster uses a databaseprotocol to accesses this database server. Several example of suchdatabase server will now be described in this section by reference toFIGS. 5-9. This section also describes various protocol messages used tocommunicate with the database server. In some embodiments, the databaseis an Open Virtual Switch (OVS) database and the protocol is an OVSdatabase protocol.

A. Locking Operations

To prevent multiple clients writing to the database at the same time,the database protocol may support various lock operations to lock orunlock the database. One of reasons for using such feature is that theyresolve conflicts between multiple writers. For example, two differentcontrollers may assume to be masters, and attempt to write to a databaseat the same time. This locking feature resolves that conflicts by makingeach network controller receive permission before writing to thedatabase. In some embodiments, each client must obtain a particular lockfrom the database server before the client can write to a certain tableof the database. The database will assign the client ownership of thelock as soon as it becomes available. When multiple clients request thesame lock, they will receive it in first-come, first served order. Afterreceiving the lock, the client can then write to the database andrelease the lock by perform an unlock operation.

FIG. 5 provides an illustrative example of how a database server 525,which executes on a dedicated hardware 520, handles conflicts in havingmultiple network controllers 505 and 510 attempting to update a database535 at the same time. Six operational stages 540-565 are shown in thisfigure. The dedicated hardware 520 includes the database server 525 andthe database client 530. The database server 525 controls access to thehardware's database 535. The dedicated hardware 520 can be any device,ranging from switches to appliances such as firewalls, load balancers,etc.

To simplify the description, only two network controllers 505 and 510are shown in this figure. However, there can be additional networkcontrollers. Each network controller has a database client (515 or 570)that can communicate with the database server 525 to access the database535. In some embodiments, each client (515 or 570) communicates with thedatabase server 525 through the database communication protocol (e.g., aJavaScript Object Notation (JSON) remote procedure call (RPC)-basedprotocol).

The first stage 540 shows the client 570 on the network controller 505making a first call (e.g., RPC) to the database server 525. The firstcall is a lock operation call. In some embodiments, the database server525 will assign the client 570 ownership of the lock as soon as itbecomes available. When multiple clients request the same lock, theywill receive it in first-come, first served basis. The database server525 of some embodiments supports an arbitrary number of locks, each ofwhich is identified by an (e.g., a client-defined ID). In someembodiments, the precise usage of a lock is determined by the client(515 or 570). For example, the clients 515 and 530 may be programmed toagree that a certain table can only be written by the owner of a certainlock. That is, the database server 525 itself does not enforce anyrestrictions on how locks are used. The database server 525 simplyensures that a lock has at most one owner.

The second stage 545 assumes that the specified lock is available.Accordingly, the database server 525 sends a response. The response is“locked” notification, which notifies the client 570 that it has beengranted a lock it had previously requested with the “lock” method call.In some embodiments, the notification has the client-defined ID.

At the same second stage 545, the client 515 on the network controller510 makes a call (e.g., RPC) to the database server 525 requesting thesame lock (e.g., to modify a same table). However, the lock has alreadybeen assigned to the client 570 on the network controller 505. In someembodiments, the client 570 has to perform an unlock operation bysending an unlock request to the database server 525 with the lock ID.When the lock is released, the database server will then send the lockednotification to the next client (e.g., the client 515 or 530) that is inline to use the lock. In some embodiments, the database protocolsupports a call to steal a specified lock.

In the third stage 520, the ownership of the specified lock belongs tothe client 570 on the network controller 505. The client 515 on thenetwork controller 510 cannot access one or more tables of the database535. The enforcement of this security policy, in some embodiments, isdefined at the client 515. In other words, the database server 525 isnot programmed to stop the client 515 from the accessing one or more ofthe same tables, but the client 515 is programmed to stop itself. One ofordinary skill in the art would understand that the database servercould be programmed to enforce its own security policies. For instance,the database server 525 can be implemented such that it locks a databasetable that is being accessed by a client.

The third stage 550 illustrates the client 570 on the network controller570 exercising the ownership of the lock by sending to the databaseserver 525 a transaction call (e.g., in order to insert or update arecord in the database 535). The fourth stage 555 illustrates thedatabase server 525 performing the transaction on the database 535 andreturning a result of the transaction. The fifth stage 560 shows theclient 570 on the network controller 505 releasing the specified lock.In some embodiments, the database server 525 sends a confirmationmessage (not shown) of releasing the lock to the database client 570 onthe network controller 505. As the lock is released, the sixth stage 565shows the ownership of the lock being transferred to the client 515 onthe network controller 510.

In the example described above, two clients 570 and 515 on the networkcontrollers 505 and 510 attempt to perform transactions on the samedatabase table at the same time. In some embodiments, the client 530 onthe dedicated hardware 520 must take ownership of the lock to write to adatabase table. That is, when the client (570 or 515) on the networkcontroller (505 or 510) has ownership of the lock, the client 530 on thehardware 520 must wait until the lock becomes available.

B. Bi-Directional Asynchronous Notifications

In some embodiments, the database server supports bi-directionalasynchronous notifications. For example, when there is an update to adatabase table, the database server sends a notification regarding anupdate to a client (e.g., executing on a network controller or on thehardware). The notification may include a copy of the table or a subsetof the updated table (e.g., a record). The bi-directional asynchronousnotification mechanism is different from polling the database server todetermine if there are any changes to one or more tables. Theasynchronous notification feature is particularly useful because eachclient executing on a network controller or on the hardware receives amessage when a set of one or more columns of interest has changed.

The database protocol of some embodiments provides a mechanism toreceive notifications (e.g., asynchronous notification). In someembodiments, each client can use the database protocol's monitor orregistration request to monitor a database. This request enables theclient to replicate tables or subsets of tables within the database byrequesting notifications of changes to those tables and by receiving thecomplete initial state of a table or a subset of a table. With thisdatabase call, the client can specify one or more table columns tomonitor. The columns can be from any number of database tables. Theclient can also use the monitor cannel request to cancel a previouslyissued monitor request.

The client of some embodiments can specify different types of operationsto monitor, such as initial, insert, delete, and modify. The initialparameter specifies that each and every row in the initial table be sentas part of the response to the monitor request. The insert parameterspecifies that update notifications be sent for rows newly inserted intothe table. The delete parameter specifies that update notifications besent for rows deleted from the table. The modify parameter specifiesthat update notifications are sent whenever a row in the table ismodified. When there is a change to any one of the monitored fields, thedatabase server of some embodiments returns a table update object. Thisobject may include previous row values if the rows values have beenmodified or deleted.

FIG. 6 illustrates an example of the client 530 on the dedicatedhardware 520 being notified of an update to a database table. Fouroperational stages 605-620 are shown in this figure. The figure showsthe network controller 515 communicating with the hardware 520. Thenetwork controller 505 includes the database client 570, and thehardware 520 includes the database client 530 and the database server525.

The first stage 605 illustrates the client 570 on the network controller505 sending a transaction request to the database server 525 on thehardware 520. The transaction call includes a request to insert a newrecord on a particular table. The client 530 on the hardware 520 haspreviously sent the database server 525 a request to monitor this table.

The second stage 610 shows the database server 525 on the hardware 520performing the requested operation on the database 535. In particular,the database server 525 inserts a new record in the particular table.The third stage 615 shows the database server 525 sending the result ofthe transaction to the client 570 on the network controller 505. Afterperforming the insert operation, the database server 525 sends (at stage620) the update notification to the client 530 on the hardware 520.

One of the main uses of the contents of the hardware's database is toexchange forwarding tables (e.g., L2 forwarding tables) between thenetwork controller cluster and the hardware switch. The physical switchis publishing its own forwarding tables in the database. One of way thiscan happens is the database client on the physical switch would makeupdates to the database contents. This would in turn generate anotification to a client on a network controller.

In some embodiments, the protocol's update call is used to exchangeforwarding state. For instance, when the network controller 570 isnotified of an update to a software switch, the network controller'sclient accesses the database server 525 executing on the hardware 520 towrite (e.g., a MAC address of a VM) to a database table. This will inturn cause the database server 525 to push the update to the databaseclient 530 executing on the hardware 520. The hardware switch's softwarestack can then translate the update to a flow that it understands toprocess packets. Example of pushing such addresses to the hardware 520will be described below by reference to FIG. 15. The update call canalso be used for management purposes, such as setting up tunnel portsfor the hardware 520. Example of pushing management data to the hardware520 will be described below by reference to FIG. 13.

FIG. 7 illustrates an example of the client 570 on the networkcontroller 505 being notified of an update to a database table. Thisfigure is similar to the previous figure. However, in this example, theclient 530 on the hardware 520 is inserting a new record in the database535, and the client 570 on the network controller 505 is being notifiedof the update.

Four operational stages 705-720 are shown in this figure. The firststage 705 illustrates the client 730 on the hardware 520 sending atransaction request 525 to the database server 525. The transaction callincludes a request to insert a new record on a particular table. Theclient 570 on the network controller 505 has previously sent thedatabase server 525 a request to monitor this table.

The second stage 710 shows the database server 525 on the hardware 525performing the requested operation on the database 535. In particular,the database server 525 inserts a new record in the particular table.The third stage 715 shows the database server 525 sending the result ofthe transaction to the client 530. After performing the insertoperation, the database server 525 sends (at stage 720) the update tothe client on the hardware 520.

In some embodiments, the protocol's update call is used to exchangeforwarding state. For instance, if the hardware 520 is a switch, it canpublish its forwarding state by having the database client 530 write(e.g., a learned MAC address) to a database table. This will in turncause the database server 525 to push the update to the database client570 executing on the network controller 505. The network controller 505can then notify other network elements (e.g., the software switches,hardware switches) regarding the update. An example of publishingaddresses of machines attached to a switch's ports will be describedbelow by reference to FIG. 14. The update call can also be used formanagement purposes, such reading an inventory of physical ports on thehardware 520, setting up tunnel ports, etc.

C. Handling Transactions

In some embodiments, the database server is designed to deal with anyerrors in completing transactions. As an example, for a giventransaction, the database server of some embodiments executes a set ofoperations on a database. When there are multiple operations, thedatabase server may execute each operation in a specified order, exceptthat if an operation fails, then the remaining operations are notexecuted. The set of operations is executed as a single atomic,consistent, isolated transaction. The transaction is committed only ifeach and every operation succeeds. In this manner, the database serverprovide a means of transacting with a client that maintains thereliability of the database even when there is a failure in completingeach operation in the set of operations.

FIG. 8 conceptually illustrates a process 800 that some embodimentsperform on a given transaction. The process 800 of some embodiments isperformed by a database server that executes on dedicated equipment. Asshown, the process 800 begins when it receives (at 805) a transactionthat includes a set of operations to perform on a database. Theoperations can be different types of operations, such as insert, delete,update, etc. In some embodiments, each transaction can include differentcombinations of different types of operations (e.g., insert and update,delete and insert, etc.).

At D15, the process 800 performs the next operation on the database. Theprocess 800 then determines (at 815) whether the operation on thedatabase has been successfully completed or has failed. When theoperation has failed, the process 800 specifies (at 825) that there wasan error performing the operation. The process then proceeds to 840,which is described below.

When the operation has not failed, the process 800 specifies (at 825)that the operation has been performed successfully. The process 800 thendetermines (at 830) whether there is another operation to perform. Asstated above, the transaction can include more than one operation toperform on the database. If there is another operation, the process 800returns to 810, which is described above. If there is no otheroperation, the process 800 commits (at 835) the transaction. Here, theprocess 800 is committing the transaction because each and everyoperation has been successfully performed.

At 840, the process 800 sends the result of the transaction to theclient. If the transaction has been committed, the result includes, foreach operation, an indication that the operation has been successfullyperformed or an indication that an operation has failed. Theseindications are based on what the process 800 specified at operations820 and 825, which are described above. The indication's value can besimple either a zero or a one Boolean value, which indicates failure orsuccess, respectively, or vice versa.

Some embodiments perform variations on the process 800. The specificoperations of the process 800 may not be performed in the exact ordershown and described. The specific operations may not be performed in onecontinuous series of operations, and different specific operations maybe performed in different embodiments. Furthermore, the process 800could be implemented using several sub-processes, or as part of a largermacro process.

D. Garbage Collection

In some embodiments, the database server, which is installed onhardware, is performs a garbage collection operation on a database.Garbage collection is the process of maintaining the database byremoving data in a database that is no longer used or outdated. In someembodiments, the garbage collection is performed at runtime (e.g., witheach given transaction), performed periodically (e.g., at set intervalas a background task), or performed when triggered (e.g., when there isa change to the database or a change to a particular table of thedatabase).

In some embodiments, if a table entry (e.g., a record) is not part of aroot set and the entry has no reference or a reference of a particularimportance (e.g., a strong reference), then the table entry is subjectto garbage collection. The database server of some embodiments allows atable to be defined as a root table. If the table is defined as a roottable, the table's entries exist independent of any references. Theentries in the root table can be thought off as part of the “root set”in a garbage collector. In referencing other entries, the databaseserver of some embodiments allows each entry to refer to another entryin the same database table or a different database table. To uniquelyidentify different entries, each entry may be associated with a uniqueidentifier, such as a primary key, Universally Unique Identifier (UUID),etc.

The garbage collection process prevents the database from continuallygrowing in size over time with unused data. For instance, a Top of Rack(TOR) switch may have a database that stores data relating logicalswitches. The logical switch information can include the addresses(e.g., IP addresses) of the tunnel endpoints that instantiate thelogical switch and other supplemental data. In such cases, the garbagecollection feature allows those addresses and supplemental data to bedeleted from the database when the corresponding logical switch isdeleted.

FIG. 9 conceptually illustrates an example of performing a garbagecollection operation on a database 900. Specifically, this figure showsseveral entries being deleted from different tables 910 and 915 when anentry is deleted from a root table 905. Five operational stages 920-940of the database 900 are shown in this figure.

The first stage 920 shows the database 900 prior to deleting an entryfrom the root table 905. To simplify the discussion, each of the tables905-915 shows only three entries. Specifically, the root table 905 hasthree root entries. The table 910 has three entries with the first twoentries referencing the first entry of the root table 905 and the thirdentry referencing the third entry of the same root table. The table 915has three entries that all reference the first entry of the table 910.In some embodiments, the table 910 and 910 represents child tables.However, as mentioned above, an entry in a table may reference anotherentry in the same table. Thus, a child entry can be in a same table as aparent entry.

The second stage 925 illustrates the first entry being deleted from theroot table 905. The third stage 930 shows a garbage collection processbeing performed on the database 900. In particular, the third stage 930shows that two entries in the table 910 being deleted as a result of thefirst entry being deleted from the root table 905. In some embodiments,the two entries are deleted from the table 910 because they are not partof a root set and/or each of the entries has no references of aparticular value (e.g., a strong reference). This means that thereferences in the table 915 to the first entry in the table 915 are notreference of a particular importance. If the references were of strongimportance, the first entry in the table 910 would not be subject togarbage collection, in some embodiments.

The fourth stage 930 shows the continuation of the garbage collectionprocess. Here, the process deletes all the entries from the table 915.These entries are deleted from the table 915 because they are not partof a root set and are not referenced in a particular manner by any childor leaf entries. The fifth stage 935 shows the result of the garbagecollection process. Specifically, it shows that remaining entries afterthe garbage collection process, which are two entries in the root table905 and one entry in the table 910.

III. Defining Virtual Networks

The network controller cluster of some embodiments can define a virtualnetwork through software and hardware forwarding elements. That is, thenetwork controller cluster provides interfaces for configurationmanagement and asynchronous notification, which can be used to create asingle logical switch across multiple software and hardware forwardingelements. These forwarding elements can be at one location (e.g., at onedatacenter) or at multiple different location (e.g., differentdatacenters). Several examples of such virtual networks will now bedescribed below by reference to FIGS. 10-13.

A. Creating Tunnels

FIG. 10 illustrates an example of a controller cluster that communicateswith different devices to create tunnels. The devices include a top ofrack (TOR) switch 1025, which is a hardware switch, and two networkhypervisor 1015 and 1020. Each hypervisor runs an instance of a virtualor software switch, such as Open Virtual Switch (OVS). In this example,the devices may be operating at different locations (e.g., at differentdatacenters).

Two operational stages 1005 and 1010 are shown in FIG. 10. The firststage 1005 illustrates an example control plane implementation thatdefines tunnels between the devices 1015, 1020, and 1025. The tunnelsextend the virtual network overlay among hypervisors 1015 and 1020 andthe TOR switch 1025. In this example, the network controller cluster 130exchanges management data with the hypervisors 1015 and 1020 using thedatabase protocol. The management data includes instructions on how eachdevice should create a tunnel between itself and another device. Forinstance, each of tunnel endpoint may receive addresses (IP addresses)of other tunnel endpoints, addresses of service nodes, virtual networkidentifiers (VNIs) to use for tunnels, etc. In some embodiments, the VNIis also referred to as a logical forwarding element identifier (LFEI).This is because the identifier is used to effectuate the creation of thelogical forwarding element from several managed edge forwardingelements. That is, two or more edge managed forwarding elements use thesame VNI to create a separate broadcast domain for a set of machinesattached their virtual interfaces or ports.

In some embodiments, the network controller cluster 130 computes andpushes flows to the hypervisors 1015 and 1020 through the OpenFlowchannel. The network controller cluster 130 exchanges management datawith the TOR switch 1025 using the database protocol. The databaseprotocol is also used by the controller cluster 130 and the TOR switch1025 to exchange forwarding state.

The second stage 1005 shows an example data plane view of the overlaynetwork that is established through the controller cluster 130 and thedevices 1015-10255. In particular, the network controller cluster 130has pushed management data to establish a tunnel between the twohypervisors 1015 and 1020, between the hypervisor 1015 and the TORswitch 1025, and between the hypervisor 1020 and the TOR switch 1025.

The second stage 1005 shows Virtual Extensible LAN (VXLAN) tunnels.However, other types of tunnels can be established depending on theencapsulation protocol that is used. As an example, the network controlcluster 1010 can be configured to create different types of tunnels suchas Stateless Transport Tunneling (STT) tunnels, Network Virtualizationusing Generic Routing Encapsulation (NVGRE), etc.

The previous example illustrated tunnels being established betweenseveral edge devices. In some embodiments, the tunnels are establishedto create virtual private networks (VPNs) for different entities (e.g.,clients, customers, tenants). The network controller cluster of someembodiments implements multiple logical switches elements acrossdifferent types of switching elements, including software and hardwareswitches. This allows multiple different logical forwarding elements tobe implemented across any combination of software and/or hardwareswitches.

B. Logical Forwarding Elements

FIGS. 11 and 12 illustrate an example of defining two separate logicalswitches. Specifically, FIG. 11 shows an example physical topology withseveral tunnel endpoints, while FIG. 12 shows an example logicaltopology created from the physical topology through the control of thenetwork controller cluster.

As shown in FIG. 11, the TOR switch 1025 has two ports that areconnected to machines associated with two different entities.Specifically, port 1, which is paired with VLAN 1, is connected to amachine (e.g., computer) that is associated with a first entity (e.g., aclient, customer, or tenant). Port 2, which is paired with VLAN 2, isconnected to another machine that is associated with a second entity.The figure conceptually shows that two ports are connected to differentmachines for different entities as port 1 has a lighter shade of graythan port 2.

In relation to the port and VLAN pairing, when a hardware switchreceives traffic from a forwarding element (e.g., a hypervisor), thehardware drops that packet on its physical port. At that point, thehardware switch might be trunking multiple VLANs. In some embodiments,the system (e.g., the network controller cluster) keeps track of theVLANs and its mapping to the VNI. For example, there might be a logicalswitch 1 with a VNI 1, and VLAN1 on port 1 is bridged to that logicalswitch. What this means is that when the physical switch receives apacket for VNI 1, it will essentially drop the packet on its port 1,using the VLAN1. When you are directly attaching a machine (e.g., aserver computer) to an L2 hardware switch, there is no L3 tunnelestablished between the switch and the machine. So, instead of the VNI,the L2 VLAN is used to distinguish traffic for different entities (e.g.,tenants, customers, etc.). In other words, this is essentially atranslation or mapping mechanism from one side to the other. On oneside, there is the tunnel IP fabric, where you use VNI, and on the otherside, there is the physical L2 fabric, where you cannot use VNI becauseit is connecting directly to the machine.

Similar to the TOR switch 1025, each hypervisor (1015 or 1020) iscommunicatively coupled to machines that are associated with differententities. This is shown by the virtual interfaces (VIFs) on thehypervisors 1015 and 1020. In some embodiments, a VIF is an interfacethat connects a virtual machine (VM) to a software switch. The softwareswitch provides connectivity between the VIFs and the hypervisor'sphysical interfaces (PIFs), and handle traffic between VIFs collocatedon the same hypervisor.

VIF1 on hypervisor 1015 and VIF2 on hypervisor 1020 are each connectedto a VM associated with the first entity. As mentioned above, port 1 ofthe TOR switch 1025 is also connected to a machine associated with thisfirst entity. On the other hand, VIF3 on hypervisor 1015 and VIF4 onhypervisor 1020 are connected to VMs of the second entity. Port 2 of theTOR switch 1025 is connected to a machine associated with this secondentity. FIG. 11 conceptually shows that VIFs are connected to differentVMs for different entities as the VIFs for the first entity's VM has alighter shade of gray than the VIFs for the second entity's VMs.

The network controller cluster of some embodiments creates logicalswitches 1205 and 1210 by assigning different virtual networkidentifiers (VNIs) for different entities (e.g., clients, customers,tenants). This is shown in FIG. 12 because each of the logical switchesis assigned a unique VNI. Specifically, the first entity is assigned aVNI of 1, and the second entity is assigned a VNI of 2.

Network traffic between the separate tunnel endpoints 1015, 1020, and1025 are encapsulated with the two VNIs. The encapsulation creates twoseparate broadcast domains for the machines of the two entities.Accordingly, the logical switch 1205 has logical ports that areassociated with the machines associated with the first entity, and thelogical switch 1210 has logical ports that are associated with themachines associated with the second entity.

As shown in FIG. 12, the logical switch 1205 has a logical port 1, whichis associated with VIF1 of the hypervisor 1015, logical port 2, which isassociated with VIF2 of the hypervisor 1020, and a logical tunnel port1, which is associated with the port 1 and VLAN 1 pair of the TOR switch1025. These VIFs and the port and VLAN pair are all connected to themachines associated with the first entity. In the logical abstraction,there are several logical ports. In some embodiments, each port has tohave an attachment to send or receive traffic. The attachment is shownadjacent to the port number. For example, port 1 and VLAN1 of thephysical switch is attached to LTP. Also, the names of the ports aresymbolic name. In the system, each port may be identified by auniversally unique identifier UUID, and that identifier is unique foreach port.

On the other hand, the logical switch 1205 has a logical port 3, whichis associated with VIF3 of the hypervisor 1015, logical port 4, which isassociated with VIF4 of the hypervisor 1020, and a logical tunnel port2, which is associated with the port 2 and VLAN 2 pair of the TOR switch1025. These VIFs and the port and VLAN pair are all connected to themachines associated with the second entity.

C. Example Process for Configuring Hardware

As mentioned above, some embodiments connect a wide range of third partydevices, ranging from top-of-rack switches to appliances such asfirewalls, load balancers, etc., to managed virtual networks. In thedata plane, tunnels are used to extend the virtual network overlay amongthe hypervisors and third party devices. A variety of tunnelingencapsulations may be used (VXLAN, STT, GRE, etc.). In the controlplane, the one or more controllers provide sufficient information to adevice (e.g., a third party device) to enable it to connect a “service”to a virtual network. The definition of a service in this context isdependent on the third party device. For example, it could be a firewallinstance, or a VLAN to be bridged into a virtual network by a TORswitch.

In some embodiments, the exchange of information takes place using adatabase that resides in the device (e.g., third party device). In thecase where a physical switch is to be connected to a managed virtualnetwork, the switch provides an inventory of its physical ports to a setof network controllers via a database protocol. This provides the set ofnetwork controllers with information that allows it to manage physicalports much as it does virtual ports. In other words, physical ports canbe attached to virtual networks; policies can be attached to the ports,etc.

When connecting higher-level services (e.g. firewalls, load balancers,etc.) to virtual networks, different mechanisms can be used to create abinding between a service instance and a virtual network. For example,the management interface to the device (e.g., the third party device)could provide this capability, using a universally unique identifier(UUID) of the virtual network (provided by the set of network controllerusing the database protocol described herein) to perform the mapping.

From a data plane perspective, a device (e.g., third party box) needs toat least some of the following: (1) receive packets on a tunnel from avirtual switch (e.g., OVS) instance and forward them into theappropriate service context or port on the device, (2) forward packetsfrom a given context or port to the appropriate virtual switch instanceover a tunnel (in some cases, the appropriate virtual switch instancemay be in a service node); (3) apply the appropriate tunnelencapsulation when forwarding packets to virtual switch instances (thisincludes both tunnel type (VXLAN, GRE, etc.) and appropriate keys, VNIs,etc.), (4) in some cases, replicate packets to multiple virtualinstances, and in some cases, perform some data plane learning.

The database protocol described herein enables the hardware to performthese data plane functions. It also allows information to be fed by thehardware to the set of network controllers to allow the correctforwarding of packets from managed virtual networks to third partydevices. Notably, a hardware switch (TOR switch) can notify the set ofnetwork controllers of the addresses (MAC addresses) that have beenlearned on its physical ports so that the set of network controller canfacilitate in forwarding packets destined for those addresses toward thehardware switch.

The database server provides a means to exchange information between adevice (e.g., third party device) and the set of network controllers.The database server resides on the device, and has multiple clients: alocal client, and a remote client (e.g., a network controller). Thenetwork controller and the third party device exchange information byreading and writing to the common database resident on the device. Forexample, a device can notify the network controller that it has acertain set of physical ports available by writing to the database,which is then read by the network controller. Similarly, one or morenetwork controllers can instruct a device to connect a physical port toa particular logical network by writing an entry into a table containingsuch mappings.

FIG. 13 conceptually illustrates a process 1300 that some embodimentsperform to exchange management data with such a device. In someembodiments, the process 1300 is performed by a set of one or morenetwork controllers. The process 1300 begins when with it communicateswith a database server executing on the device. As shown, the process1300 reads (at 1305) and inventory of physical ports from the database.The process 1300 then writes (at 1310) to the database a virtual networkidentifier (VNI) for identifying a logical switch. For example, therecan be one hypervisor and one TOR switch, there can be multipledifferent logical networks but only a single tunnel between thehypervisor and the TOR switch. So, the VNI is used to identify one ofthe different logical network s when sending traffic between twoforwarding elements (e.g., between the hypervisor and the TOR, betweentwo hypervisors, between two TOR switches, etc.).

Accordingly, the VNI is a network identifier. Depending on the protocol,this is typically some bit of data (e.g., 24 bit) to allow some amountof (e.g., 16 million) logical networks. In some embodiments, each VNIrepresents a virtual Layer-2 broadcast domain and routes can beconfigured for communication based on the VNI. Different protocols maygive this identifier different name, such as tenant network identifier,virtual private network identifier, etc.

At 1315, the process, for a physical port of the hardware, writes to thedatabase a mapping that binds a VLAN to the logical switch. The processcan perform these operations for multiple physical ports. Accordingly,the process 1300 determines (at 1320) whether to do additional binding.If so, the process returns to 1315, which is described above. Otherwise,the process 1300 proceeds to 1325. The process 1300 writes (at 1325) tothe database addresses (e.g., IP addresses) of a set of tunnels thatuses the same VNI of the logical switch. The process 1300 then ends.

Some embodiments perform variations on the process 1300. The specificoperations of the process 1300 may not be performed in the exact ordershown and described. For example, the process might read the inventoryof physical ports with one transaction and later perform a number ofother transactions to write to the database. The specific operations maynot be performed in one continuous series of operations, and differentspecific operations may be performed in different embodiments. As anexample, the process might iteratively perform the operations multipletimes to connect the hardware to another virtual network. Furthermore,the process 1300 could be implemented using several sub-processes, or aspart of a larger macro process.

Having described an example process, example applications of the processwill now be described in the following sub-sections.

1. Connecting Physical Workloads to a Virtual network

The following sub-section provides an example of using the process 1300to connect some physical servers (e.g., desktop computers) to a virtualnetwork. In this example, a physical switch (e.g., a VXLAN-capable L2switch) provides the gateway function between the physical servers andthe virtual network. The physical switch runs an instance of a databaseserver. One client of the database server is local to the switch, and atleast one other client runs inside a network controller.

In some embodiments, initial configuration of the VXLAN gateway isperformed using out-of-band mechanisms, such as the command lineinterface (CLI) of the gateway. Such configuration would include theconfiguration of physical ports, providing an IP address of the networkcontroller, etc., but no virtual network provisioning. The localdatabase client will populate the database with this configuredinformation, making it available to the network controller.

The creation of virtual networks is handled by protocols calls to thenetwork controller. To simply the discussion, it is assumed that thevirtual network is a VXLAN. Further, protocol calls will request thatnetwork controller connect physical ports, or 802.1q-tagged ports, tothe VXLAN. The network controller of some embodiments then writes thefollowing information into the database on the physical switch.

-   -   The VNI for the VXLAN    -   The set of tunnels that instantiate the VXLAN. As mentioned        above, each of the tunnels may be identified by the IP address        of the endpoint on which it terminates.    -   A set of mappings of <port, 802.1q tag> pairs to the logical        switch (in some embodiments, the tag may be omitted if the        entire physical port is to be mapped to the logical switch).

At this point, the VXLAN exists, and a set of (possibly tagged) physicalports is mapped to the VXLAN. Further protocol calls to the networkcontroller are used to connect VMs to the VXLAN. To enable the physicalswitch to send packets to those VMs, the network controller writes MACaddresses of the VM s on the database on the physical switch. Fortraffic in the other direction, from VMs towards the physical switch,the client in the physical switch also writes learned MAC addresses ofphysical servers connected to its ports. Example of such operations willbe described below by reference to FIGS. 14 and 15. In some cases, it isalso may be necessary to deal with unknown MAC addresses, at least forpackets destined to the physical world. This is handled as a specialcase of multicast, with the “unknown” MAC address pointing to a set ofone or more tunnels. An example of such special case will be describedbelow by reference to FIG. 19.

In the example described above, the VXLAN successfully emulates an L2switch connecting the VMs to the gateway, with no flooding of traffictaking place within the VXLAN to discover the location of MAC addresses.In some embodiments, Address Resolution Protocol (ARP) requests may beflooded via replication at a service node.

2. Connecting a Higher Level Service to a Virtual Network

The following sub-section provides an example of using a variation onthe process 1300 to connect a device providing higher-level services,such as a firewall or a load-balancer, to a logical network that ismanaged by the controller cluster. In this case, the sequence ofoperations is slightly simplified compared to the preceding example.

The initial configuration of the device (e.g., third party device) mayvary from one device to another. In some embodiments, the initialconfiguration includes providing the device with an IP address of thenetwork controller (e.g., through the device's CLI or a managementconsole). The device runs an instance of the database server. One clientof the database server is local to the device, and at least one otherclient runs inside a network controller.

The client in the device writes to the database one or more IP addresseson which it can terminate VXLAN tunnels. At this point, the networkcontroller is able to create logical networks that connect to thedevice.

Logical network creation is triggered by a database protocol call to thenetwork controller. The network controller of some embodiments thenwrites the following information into the database on the device.

-   -   The VNI for the VXLAN    -   The set of tunnels that instantiate the VXLAN. As mentioned        above, each of the tunnels may be identified by the IP address        of the endpoint on which it terminates.    -   A set of MAC to physical locator bindings, representing the        tunnels that should be used to reach machines (e.g., VMs) that        are part of this logical network.

In some embodiments, the network controller provides mechanisms to bindan instance of a service provided by the device (e.g., third partydevice) to the logical network. For example, the device CLI could beused to create an instance of a service and bind it to a particularlogical network. In some embodiments, each logical network in thedatabase has a unique name, independent of the tunneling mechanism thatis used to create the logical network. This name should be used forbindings rather than the VXLAN-specific VNI.

In some embodiments, the device writes the MAC address of the service inthe MAC to physical locator entries for the logical switch. If theservice is implemented in a VM, this would be the MAC address of thevirtual network interface card (VNIC). In some embodiments, ARP requestsare handled by replication at a service node.

IV. Exchanging Forwarding State with Hardware through AsynchronousNotifications

The previous section described example of configuration hardware to addthe hardware to a managed virtual network. Once the hardware isconfigured, the network controller of some embodiments exchangesforwarding state with the hardware. As an example, a hardware switch canpublish its forwarding state by having its database client write alearned MAC address to a database table. This will in turn cause thedatabase server to push the update to a database client executing on anetwork controller. The network controller can then notify otherforwarding elements (e.g., software switches, hardware switches)regarding the update.

In some embodiments, uploading the MAC addresses is a continuousprocess. The hardware switch will keep on learning MAC addresses. Forexample, the hardware switch might have various time mechanisms. If amachine is moved away or silent for a long period time, the hardwareswitch might remove the machine's MAC address. The hardware switch mightadd the address again if it detects that machine is connected to one ofits physical ports. In some embodiments, each MAC address is publishedby the forwarding element (e.g., hardware switch, hypervisor) with theforwarding element's IP address.

One the other side, the network controller knows the MAC addresses ofthe virtual machines connected to software switches. To stress, the MACaddresses of the virtual machines are known address and not learned MACaddresses. These known MAC address are being pushed with the networkcontroller writing to the hardware switch's database. One reason theyare known rather than learned is that VMs are connected to virtualinterfaces (VIFs) and there is no timeout mechanism associated with theVIFs. The VM is either associated with one VIF or not associated withthat VIF. In some embodiments, the forwarding state of a softwareswitching element is asynchronously sent from the network controller tothe hardware switching element when the software's forwarding state haschanged (e.g., a VM is attached to a VIF).

FIG. 14 provides a data flow diagram that shows a top of rack (TOR)switch 1025 that publishes an address (e.g., a MAC address) of a machinethat is connected to one of its port. As shown, the data flow beginswhen the TOR switch 1025 learns a MAC address of the machine. Uponlearning the MAC address, the TOR switch 1025 might generate a flowbased on the MAC address and write the flow in its memory (e.g., CAM,TCAM). In addition, a program on the switch's software stack might senda message to the database client on the switch regarding the learned MACaddress. The database client then sends a transaction message to thedatabase server with the MAC address. In some embodiments, thetransaction message include the physical port number a VLAN identifier.

When the database is updated with the learned MAC address, the databaseserver of some embodiments sends an update notification to a networkcontroller 1405 (e.g., in a network controller cluster). In someembodiments, the network controller has a database client that has beenconfigured to monitor updates to a table or a subset of table using thedatabase protocol. For example, the client might have previously sent amonitor or registration request with a set of parameters (e.g., thecolumns being monitored, what kind of transactions are being monitored,such as initial, insert, delete, modify, etc.). The database server ofsome embodiments sends an update notification that has a copy of thetable or a subset of the table (e.g., a record) that was updated.Several examples of a client on monitoring a database are describedabove by reference to FIGS. 6 and 7.

As shown in FIG. 14, the TOR switch 1025 publishes the MAC address tothe network controller 1405. In some embodiments, the MAC address issent with a virtual network identifier (VNI) of the logical switch andan IP address of the TOR switch 1025. The network controller receivesthe information and then pushes it down to the hypervisors 1015 and1020. Each of the hypervisors 1015 and 1020 then updates its respectiveflow table with a flow that includes the MAC address and action toforward the packet with the same MAC address (e.g., in the header'sdestination MAC address field) to the TOR switch 1025. In someembodiments, the network controller 1405 receives the MAC address,computes a flow, and sends the flow to each of the hypervisors 1015 and1020. The flow may be sent to each hypervisor using the OpenFlowProtocol. That is, instead of the database protocol (e.g., the OVDB)channel, the flow will be sent to the hypervisors using one or moreOpenFlow calls. However, if a tunnel did not exist between the TORswitch 1025 and the hypervisor (1015 or 1020), then network controllerwould have to send instructions (e.g., the address of the TOR switch,VNIs, etc.) to the hypervisor to create the tunnel. The instructionssent to the hypervisor using the database protocol (e.g., OVSDBprotocol), in some embodiments.

In the example described above, the TOR switch 1025 publishes a MACaddress to the network controller 1405. FIG. 15 provides a data flowdiagram that shows the hypervisor 1020 that publishing a MAC address ofa virtual machine (VM) that is connected to its virtual interface (VIF).This flow begins when a new VIF is created and the VM is associated withthe VIF.

As shown in FIG. 15, the hypervisor 1020 is notified of the MAC addressof the VM attached to VIF4. In some embodiments, when a VM is attachedto a VIF, the software switch executing on the hypervisor is notified ofthe MAC address of the VM. The software switch then sends the MACaddress to the network controller 1405. In some embodiments, the MACaddress is sent with one or more of the following: the VNI to identifythe logical switch, a logical port number, and IP address of thehypervisor 1020.

Having received the information, the network controller 1405 sends theinformation to the hypervisor 1015 and the TOR switch 1025. Thehypervisor 1415 receives the information. The hypervisor 1015 thenupdates its flow table with a flow that includes the MAC address andaction to forward the packet with the same MAC address (e.g., in theheader's destination MAC address field) to the hypervisor 1025. In someembodiments, the network controller 1405 receives the information,computes a flow, and sends the flow to the hypervisor 1015. The flow maybe sent to each hypervisor using the OpenFlow Protocol. That is, insteadof the database protocol (e.g., the OVDB) channel, the flow will be sentto the hypervisors using one or more OpenFlow calls.

Different from the hypervisor 1015, the network controller 1405 sendsthe information to the TOR switch by first creating a new transactionmessage. The transaction is then sent to the database server on the TORswitch 1025. The database server then updates the database based on theinstructions in the message. In some embodiments, the database client onthe TOR switch 1025 is notified of the update. The update is then passedto a program on the switch's software stack. The program uses the updateto generate a flow that includes the MAC address and action to forwardthe packet with the same MAC address (e.g., in the header's destinationMAC address field) to the hypervisor 1015.

V. Packet Flow Examples

The previous section described several examples of the networkcontroller exchanging forwarding state with the dedicated hardware.Several examples of forwarding packets will now be described below byreference to FIGS. 16-19.

In these examples, a set of one or more network controllers has computeda set of flows for a software switching to forward packets to otherswitching elements (e.g., software switching element, hardware switchingelement). The set of network controllers has also exchanged forwardingstate with one or more hardware switching element. Each hardwareswitching element has translated the state to a set of flows that it canunderstand to forward packet.

FIG. 16 illustrates a packet flow diagram flow for a known unicast.Specifically, this figure shows how the TOR switch 1025 receiving apacket and forwarding it to the hypervisor 1020. In this example, theTOR switch 1025 has programmed a flow in its ASCIC based on theinformation (e.g., the MAC address, VNI, etc.) received from the networkcontroller (not shown). A machine is connected to a port. The port isassociated with a VLAN identifier.

The flow begins when a packet is received though a physical port of theTOR switch 1025. The physical port is associated with a port and VLANpairing (i.e., port 2, VLAN2). As shown by the box above the figurenumber, the TOR switch 1025 performs a look-up operation. For instance,when the packet is received through on port 2 and VLAN 2, thedestination MAC address is already there in the packet. The TOR switchlooks at what port this packet came in and what the VLAD ID is, andbased on the information, it determines that the packet should betunneled to the hypervisor 1020 using the VNI 2. In the example of FIG.16, the hypervisor 1020 is identified by its associated IP address.

After performing the look-up operation, the TOR switch 1025 encapsulatesthe packet using the tunnel protocol and sends the packet over thetunnel to the hypervisor 1020. Specifically, the packet is encapsulatedusing the VNI (e.g., the VNI 2). Here, the TOR switch 1025 sends thepacket to the hypervisor 1020 using the IP address of the hypervisor.

When the hypervisor 1020 receives the packet, it performs a flow lookupoperation to output to the packet to an appropriate port. Specifically,it parses the packet to identify header values, such as the destinationMAC address. The hypervisor 1020 then tries to find a matching flow inits datapath or cache. If the matching flow is found, the hypervisor1020 uses the flow to output the packet a VM. If no matching flow isfound, the hypervisor generate a flow to push into the datapath toprocess the packet. In this example, the packet is output to VIF4, whichis associated with logical port 4 (e.g., in FIG. 12).

FIG. 17 illustrates a packet flow diagram flow for a multicast. In someembodiments, the network virtualization system utilizes a service nodeto perform a multicast, such as a broadcast. For example, when ahypervisor receives a packet with instructions to broadcast it, thehypervisor sends the packet to the service node. The service node thenreplicates the packet to one or more other hardware switching elements.One of the reasons for utilizing such a service node is that a softwareswitching element has known addresses (e.g., MAC addresses) of machines,whereas a hardware switching element may have one or more unknownaddresses of machines.

FIG. 17 shows an example of a hypervisor 1720 receiving a packet from amachine and sending the packet to a service node 1205 in order tobroadcast the packet to other machines in the virtual network that arein the same broadcast domain. The flow begins when a packet is receivedthough a virtual interface (VIF) of the hypervisor 1720.

As shown by the box above the figure number, the hypervisor 1720performs a look-up operation. For instance, when the packet is receivedthrough VIF1, the hypervisor looks at this information and othersupplemental information (e.g., the logical port number associated withthe VIF), and, based on one or more pieces of those information, itdetermines that the packet should be tunneled to the service node 1705using the VNI 2. In the example of FIG. 16, the service node 1705 isidentified by its associated IP address.

When the service node 1705 receives the packet, it replicates the packetand sends the packet to each of the TOR switches 1710 and 1715. In someembodiments, the service node 1705 send the packet over a tunnel that isestablished between the service node and each of the two TOR switches1710 and 1715. The packet is sent over the corresponding tunnel usingthe VNI. As shown, the service node 1705 forwards the packets to thetunnel endpoints, namely the TOR switches 1710 and 1715. The TOR switch1710 send the packet out Port 1 and VLAN 1, and the TOR switch 1715sends the packet out of Port 4 VLAN 1.

FIG. 18 shows an example of a TOR switch 1805 receiving an AddressResolution Protocol (ARP) packet from a machine and sending the packetto a service node 1205. The ARP protocol is used to translate IPaddresses to MAC addresses. In this example, the TOR switch 1805 hasprogrammed a flow in its ASCIC that identifies sending such a broadcastpacket to the service node 1705. In some embodiments, the TOR switch1805 has generated the flow with the IP address of the service node fromthe switch's database.

The flow begins when a packet is received though a physical port of theTOR switch 1805. The physical port is associated with a port and VLANpairing (i.e., port 1, VLAN1). As shown by the box, the TOR switch lookat what port this packet came in, what the VLAN ID is, and whether thisis an ARP packet and; based on the information, it determines that thepacket should be tunneled to the service node 1705 using the VNI 1.

When the service node 1705 receives the packet, the service node 1705sends the packet to each of the hypervisor 1810 and 1815. In someembodiments, the service node 1705 send the packet over a tunnel that isestablished between the service node and each of the two hypervisors1810 and 1815. The packet is sent over the corresponding tunnel usingthe VNI. As shown, the service node 1705 forwards the packets to thetunnel endpoints, namely the hypervisor 1810 and 1815. The hypervisor1810 send the packet out logical port 1 of logical switch 1 that isassociated with VIF1. The hypervisor 1815 sends the packet out logicalport 2 of logical switchl that is associated with VIF2.

FIG. 19 illustrates a packet flow diagram flow for an unknown unicast.The unknown unicast is when the destination MAC is a unicast MAC butthere is no knowledge of the MAC address. For example, the networkcontroller cluster may not recognize the MAC address. In someembodiments, the network virtualization system assumes that the VIFshave known MACs. Therefore, the hypervisors 1905 and 1915 may neverreceive a packet destined for an unknown unicast address. For example,if a hypervisor 1905 sends out an unknown unicast packet to a servicenode 1705. The service node 1705 would forward the packet each other TORswitches but not to each other hypervisors. This is because the systemassumes that the TOR switch can have unknown unicast addresses behindit.

In the example of FIG. 19, the hypervisor 1905 receives a packet from aVM and calls a service node 1705. In this example, the hypervisor 1905has flow entry to send packet with an unknown MAC address in thepacket's header to the service node 1705 that is unknown to the servicenode 1705. As shown, the flow begins when a packet is received thoughthe logical port 1. The logical port 1 is associated with VIF1. Thehypervisor 1905 then encapsulates the packet using the tunnel protocoland sends the packet over a tunnel to the service node 1705.Specifically, the packet is encapsulated using the VNI (i.e., the VNI 1for the logical switch 1). In some embodiments, the TOR switch 1910sends the packet to the service node 1705 using the IP address of theservice node 1705.

When the service node 1705 receives the packet, the service node 1705sends the packet over a tunnel that is established the service node andthe TOR switch 1910. However, the service node 1705 does not send thepacket to the hypervisor 1915. As mentioned above, this is because thesystem assumes that the TOR switch 1910 can have unknown unicast behindit, while the hypervisor 1915 only has known unicast.

The packet is sent over the corresponding tunnel using the VNI. Asshown, the service node 1705 forwards the packets to the tunnelendpoint, namely the TOR switch 1910. The hypervisor 1905 send thepacket out one of its physical ports that is associated with the portand VLAN pairing.

VI. Example Environment

The following section will describe the environment in which someembodiments of the inventions are implemented. In the presentapplication, switching elements and machines may be referred to asnetwork elements. In addition, a network that is managed by one or morenetwork controllers may be referred to as a managed network in thepresent application. In some embodiments, the managed network includesonly managed switching elements (e.g., switching elements that arecontrolled by one or more network controllers) while, in otherembodiments, the managed network includes managed switching elements aswell as unmanaged switching elements (e.g., switching elements that arenot controlled by a network controller).

In some embodiments, a network controller cluster controls one or moremanaged switching elements that are located at the edge of a network(e.g., tunnel endpoints, edge switching elements, or edge devices). Inaddition to controlling edge switching elements, the network controllercluster of some embodiments also utilize and control non-edge switchingelements that are inserted in the network to simplify and/or facilitatethe operation of the managed edge switching elements. For instance, insome embodiments, the network controller cluster requires the switchingelements that the network controller cluster controls to beinterconnected in a hierarchical switching architecture that has severaledge switching elements as the leaf nodes in the hierarchical switchingarchitecture and one or more non-edge switching elements as the non-leafnodes in this architecture. In some such embodiments, each edgeswitching element connects to one or more of the non-leaf switchingelements, and uses such non-leaf switching elements to facilitate thecommunication of the edge switching element with other edge switchingelements.

Some embodiments employ one level of non-leaf (non-edge) switchingelements that connect to edge switching elements and in some cases toother non-leaf switching elements. Other embodiments, on the other hand,employ multiple levels of non-leaf switching elements, with each levelof non-leaf switching elements after the first level serving as amechanism to facilitate communication between lower level non-leafswitching elements and leaf switching elements. In some embodiments, thenon-leaf switching elements are software switching elements that areimplemented by storing the switching tables in the memory of astandalone computer instead of an off the shelf switch. As describedherein, the leaf or non-leaf switching elements can be extended beyondsoftware switching elements to hardware switching elements.

As mentioned above, some embodiments provide a network controllercluster that communicates with a wide range of devices (e.g.,third-party hardware), ranging from switches to appliances such asfirewalls, load balancers, etc. The network controller clustercommunicates with such devices to connect them to its managed virtualnetworks. The network controller cluster can define each virtual networkthrough software switches and/or software appliances. The common featureof the different types of equipment or boxes is that each of them has adatabase server that is controlled through a protocol.

The network controller of some embodiments can be implemented insoftware as an instance of an application. As illustrated in FIG. 20,the network controllers 2010 and 2020 are instances of a softwareapplication. As shown, each of the network controllers 2010 and 2020includes several software layers: a control application layer, avirtualization application layer, and a networking operating systemlayer.

In some embodiments, the control application layer receives user inputthat specifies a network-switching element. The control applicationlayer may receive the user input in any number of different interfaces,such as a graphical user interface (GUI), a command line interface, aweb-based interface, a touchscreen interface, etc. In some embodiments,the user input specifies characteristics and behaviors of the networkswitching element, such as the number of switching element ports, accesscontrol lists (ACLs), network data forwarding, port security, or anyother network switching element configuration options.

The control application layer of some embodiments defines a logical datapath set based on user input that specifies a network switching element.As noted above, a logical data path set is a set of network data pathsthrough managed switching elements that are used to implement theuser-specified network switching element. In other words, the logicaldata path set is a logical representation of the network switchingelement and the network switch's specified characteristics andbehaviors.

Some embodiments of the virtualization application layer translate thedefined logical data path set into network configuration information forimplementing the logical network switching element across the managedswitching elements in the network. For example, the virtualizationapplication layer of some embodiments translates the defined logicaldata path set into a corresponding set of data flows. In some of thesecases, the virtualization application layer may take into accountvarious factors (e.g., logical switching elements that are currentlyimplemented across the managed switching elements, the current networktopology of the network, etc.), in determining the corresponding set ofdata flows.

The network operating system layer of some embodiments configures themanaged switching elements' routing of network data. In someembodiments, the network operating system instructs the managedswitching elements to route network data according to the set of dataflows determined by the virtualization application layer.

In some embodiments, the network operating system layer maintainsseveral views of the network based on the current network topology. Oneview that the network operating system layer maintains is a logicalview. The logical view of the network includes the different logicalswitching elements that are implemented across the managed switchingelements, in some embodiments. Some embodiments of the network operatingsystem layer maintain a managed view of the network. Such managed viewsinclude the different managed switching elements in the network (i.e.,the switching elements in the network that the network controllerscontrol). In some embodiments, the network operating system layer alsomaintains relationship data that relate the logical switching elementsimplemented across the managed switching elements to the managedswitching elements.

While FIG. 20 (and other figures in this application) may show a set ofmanaged switching elements managed by a network controller, someembodiments provide several network controllers (also referred to as acluster of network controllers or a control cluster) for managing theset of managed switching elements (e.g., virtual switches, hardwareswitches). In other embodiments, different control clusters may managedifferent sets of managed switching elements. Employing a cluster ofnetwork controllers in such embodiments to manage a set of managedswitches increases the scalability of the managed network and increasesthe redundancy and reliability of the managed network. In someembodiments, the network controllers in a control cluster share (e.g.,through the network operating system layer of the network controllers)data related to the state of the managed network in order to synchronizethe network controllers.

As mentioned above, the network control system of some embodiments canmanage both physical switching elements and software switching elements.FIG. 20 illustrates an example of such a network control system. Inparticular, this figure conceptually illustrates a network controlsystem 2000 of some embodiments for managing TOR switching element 2030and OVSs running on hosts in the racks of hosts 2070 and 2080.

In this example, the managed switching element 2030 and the OVSs runningon the hosts in the racks of hosts 2070 and 2080 are edge-switchingelements because they are the last switching elements before endmachines in the network. In particular, the network controller 2010manages the TOR switching element 2010 and the OVSs that are running onthe hosts in the rack of hosts 2060, and the network controller 2020manage the OVSs that are running on the hosts in the rack of hosts 2080.

The above figures illustrate examples of network controllers thatcontrol edge-switching elements in a network. However, in someembodiments, the network controllers can control non-edge-switchingelements as well. FIG. 21 illustrates a network control system thatincludes such network controllers. In particular, FIG. 21 conceptuallyillustrates a network control system 2100 of some embodiments formanaging TOR switching elements 2130-21210 and OVS running on hosts inthe racks of hosts 2170 and 2180.

As shown in FIG. 21, the network controllers 2110 and 2120 manage edgeswitching elements and non-edge switching elements. Specifically, thenetwork controller 2110 manages the TOR switching elements 2130 and2120, and the OVSs running on the hosts in the rack of hosts 2170. Thenetwork controller 2120 manages TOR switching element 2180 and the OVSsrunning on the hosts in the rack of hosts 2180. In this example, the TORswitching element 2130 and the OVSs running on the hosts in the racks ofhosts 2170 and 2180 are edge-switching elements, and the TOR switchingelements 2140 and 2105 are non-edge switching elements.

VII. Electronic System

Many of the above-described features and applications are implemented assoftware processes that are specified as a set of instructions recordedon a computer readable storage medium (also referred to as computerreadable medium). When these instructions are executed by one or morecomputational or processing unit(s) (e.g., one or more processors, coresof processors, or other processing units), they cause the processingunit(s) to perform the actions indicated in the instructions. Examplesof computer readable media include, but are not limited to, CD-ROMs,flash drives, random access memory (RAM) chips, hard drives, erasableprogrammable read-only memories (EPROMs), electrically erasableprogrammable read-only memories (EEPROMs), etc. The computer readablemedia does not include carrier waves and electronic signals passingwirelessly or over wired connections.

In this specification, the term “software” is meant to include firmwareresiding in read-only memory or applications stored in magnetic storage,which can be read into memory for processing by a processor. Also, insome embodiments, multiple software inventions can be implemented assub-parts of a larger program while remaining distinct softwareinventions. In some embodiments, multiple software inventions can alsobe implemented as separate programs. Finally, any combination ofseparate programs that together implement a software invention describedhere is within the scope of the invention. In some embodiments, thesoftware programs, when installed to operate on one or more electronicsystems, define one or more specific machine implementations thatexecute and perform the operations of the software programs.

FIG. 22 conceptually illustrates an electronic system 2200 with whichsome embodiments of the invention are implemented. The electronic system2200 may be a computer (e.g., a desktop computer, personal computer,tablet computer, etc.), server, dedicated switch, phone, PDA, or anyother sort of electronic or computing device. Such an electronic systemincludes various types of computer readable media and interfaces forvarious other types of computer readable media. Electronic system 2200includes a bus 2205, processing unit(s) 2210, a system memory 2225, aread-only memory 2230, a permanent storage device 2235, input devices2240, and output devices 2245.

The bus 2205 collectively represents all system, peripheral, and chipsetbuses that communicatively connect the numerous internal devices of theelectronic system 2200. For instance, the bus 2205 communicativelyconnects the processing unit(s) 2210 with the read-only memory 2230, thesystem memory 2225, and the permanent storage device 2235.

From these various memory units, the processing unit(s) 2210 retrievesinstructions to execute and data to process in order to execute theprocesses of the invention. The processing unit(s) may be a singleprocessor or a multi-core processor in different embodiments.

The read-only-memory (ROM) 2230 stores static data and instructions thatare needed by the processing unit(s) 2210 and other modules of theelectronic system. The permanent storage device 2235, on the other hand,is a read-and-write memory device. This device is a non-volatile memoryunit that stores instructions and data even when the electronic system2200 is off. Some embodiments of the invention use a mass-storage device(such as a magnetic or optical disk and its corresponding disk drive) asthe permanent storage device 2235.

Other embodiments use a removable storage device (such as a floppy disk,flash memory device, etc., and its corresponding drive) as the permanentstorage device. Like the permanent storage device 2235, the systemmemory 2225 is a read-and-write memory device. However, unlike storagedevice 2235, the system memory 2225 is a volatile read-and-write memory,such a random access memory. The system memory 2225 stores some of theinstructions and data that the processor needs at runtime. In someembodiments, the invention's processes are stored in the system memory2225, the permanent storage device 2235, and/or the read-only memory2230. From these various memory units, the processing unit(s) 2210retrieves instructions to execute and data to process in order toexecute the processes of some embodiments.

The bus 2205 also connects to the input and output devices 2240 and2245. The input devices 2240 enable the user to communicate informationand select commands to the electronic system. The input devices 2240include alphanumeric keyboards and pointing devices (also called “cursorcontrol devices”), cameras (e.g., webcams), microphones or similardevices for receiving voice commands, etc. The output devices 2245display images generated by the electronic system or otherwise outputdata. The output devices 2245 include printers and display devices, suchas cathode ray tubes (CRT) or liquid crystal displays (LCD), as well asspeakers or similar audio output devices. Some embodiments includedevices such as a touchscreen that function as both input and outputdevices.

Finally, as shown in FIG. 22, bus 2205 also couples electronic system2200 to a network 2265 through a network adapter (not shown). In thismanner, the computer can be a part of a network of computers (such as alocal area network (“LAN”), a wide area network (“WAN”), or an Intranet,or a network of networks, such as the Internet. Any or all components ofelectronic system 2200 may be used in conjunction with the invention.

Some embodiments include electronic components, such as microprocessors,storage and memory that store computer program instructions in amachine-readable or computer-readable medium (alternatively referred toas computer-readable storage media, machine-readable media, ormachine-readable storage media). Some examples of such computer-readablemedia include RAM, ROM, read-only compact discs (CD-ROM), recordablecompact discs (CD-R), rewritable compact discs (CD-RW), read-onlydigital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a varietyof recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.),flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.),magnetic and/or solid state hard drives, read-only and recordableBlu-Ray® discs, ultra density optical discs, any other optical ormagnetic media, and floppy disks. The computer-readable media may storea computer program that is executable by at least one processing unitand includes sets of instructions for performing various operations.Examples of computer programs or computer code include machine code,such as is produced by a compiler, and files including higher-level codethat are executed by a computer, an electronic component, or amicroprocessor using an interpreter.

While the above discussion primarily refers to microprocessor ormulti-core processors that execute software, some embodiments areperformed by one or more integrated circuits, such as applicationspecific integrated circuits (ASICs) or field programmable gate arrays(FPGAs). In some embodiments, such integrated circuits executeinstructions that are stored on the circuit itself In addition, someembodiments execute software stored in programmable logic devices(PLDs), ROM, or RAM devices.

As used in this specification and any claims of this application, theterms “computer”, “server”, “processor”, and “memory” all refer toelectronic or other technological devices. These terms exclude people orgroups of people. For the purposes of the specification, the termsdisplay or displaying means displaying on an electronic device. As usedin this specification and any claims of this application, the terms“computer readable medium,” “computer readable media,” and “machinereadable medium” are entirely restricted to tangible, physical objectsthat store information in a form that is readable by a computer. Theseterms exclude any wireless signals, wired download signals, and anyother ephemeral signals.

While the invention has been described with reference to numerousspecific details, one of ordinary skill in the art will recognize thatthe invention can be embodied in other specific forms without departingfrom the spirit of the invention. In addition, a number of the figures(including FIGS. 8 and 13) conceptually illustrate processes. Thespecific operations of these processes may not be performed in the exactorder shown and described. The specific operations may not be performedin one continuous series of operations, and different specificoperations may be performed in different embodiments. Furthermore, theprocess could be implemented using several sub-processes, or as part ofa larger macro process. Thus, one of ordinary skill in the art wouldunderstand that the invention is not to be limited by the foregoingillustrative details, but rather is to be defined by the appendedclaims.

What is claimed is:
 1. A method of implementing a logical switchingelement from software and hardware switching elements, the methodcomprising: from a network controller, sending a first transaction thatinstructs a database server on the hardware switching element to writeto a database a logical forwarding element identifier (LFEI) thatidentifies a logical switch; from the network controller, sending asecond transaction that instructs the database server on the hardwareswitching element to write to the database an address of at least onesoftware switching element that use the LFEI, wherein the hardwareswitching element uses the address to establish a tunnel between thehardware switching elements and the software switching element, andwherein the hardware and software switching elements implements thelogical switching element by sending packets over the established tunnelusing the LFEI.
 2. The method of claim 1 further comprising: prior tosending the first transaction, sending, from the network controller, alock request instructing the database server to lock a set of columns ofthe database, wherein the lock request prevents another networkcontroller from making any changes to the set of columns; and receivinga notification from the database server indicating that a lock has beengranted to the network controller.
 3. The method of claim 2 furthercomprising: upon receiving confirmation of the completion of the firsttransaction or the first and second transactions, sending an unlockrequest instructing the database server to release the lock on the setof columns so that the other network controller is able to make changesto the same set of columns.
 4. The method of claim 1, wherein each ofthe first and second transactions comprises a set of operations toperform on the database.
 5. The method of claim 4, wherein eachoperation is executed by the database server in a specified order,except that if an operation fails, then each remaining operation is notexecuted.
 6. The method of claim 5, wherein each of the first and secondtransactions is committed or applied to the database only if each andevery operation has been successfully executed.
 7. The method of claim1, wherein the database server on the hardware switching elementperforms a garbage collection operation periodically or when triggeredto remove table entries that are no longer used in the database.
 8. Themethod of claim 7, wherein the database server identifies a table entryis no longer used if the table entry is not part of a root set.
 9. Themethod of claim 1, wherein the address of the software switching elementis an internet protocol (IP) address.
 10. The method of claim 1, whereinthe software switching element is a first software switching element andthe address is a first address, the method further comprising: from thenetwork controller, sending a third transaction that instructs thedatabase server on the hardware switching element to write to thedatabase a second address of a second software switching element thatuse the same LFEI, wherein the hardware switching element uses theaddress to establish a tunnel between the hardware switching elementsand the second software switching element, and wherein the firstsoftware switching element, the second software switching element, andthe hardware switching element implements the logical switching elementby sending packets over the established tunnel using the LFEI.
 11. Anon-transitory machine readable medium storing a network controller thatwhen executed by at least one process unit facilitates in implements alogical switching element from software and hardware switching elements,the network controller comprising sets of instructions for: from thenetwork controller, sending a first transaction that instructs adatabase server on the hardware switching element to write to a databasea logical forwarding element identifier (LFEI) that identifies a logicalswitch; from the network controller, sending a second transaction thatinstructs the database server on the hardware switching element to writeto the database an address of at least one software switching elementthat use the LFEI, wherein the hardware switching element uses theaddress to establish a tunnel between the hardware switching elementsand the software switching element, and wherein the hardware andsoftware switching elements implements the logical switching element bysending packets over the established tunnel using the LFEI.
 12. Thenon-transitory machine readable medium 11, wherein the program furthercomprises sets of instructions for: prior to sending the firsttransaction, sending, from the network controller, a lock requestinstructing the database server to lock a set of columns of thedatabase, wherein the lock request prevents another network controllerfrom making any changes to the set of columns; and receiving anotification from the database server indicating that a lock has beengranted to the network controller.
 13. The non-transitory machinereadable medium 12, wherein the program further comprises a set ofinstructions for: upon receiving confirmation of the completion of thefirst transaction or the first and second transactions, sending anunlock request instructing the database server to release the lock onthe set of columns so that the other network controller is able to makechanges to the same set of columns.
 14. The non-transitory machinereadable medium 11, wherein each of the first and second transactionscomprises a set of operations to perform on the database.
 15. Thenon-transitory machine readable medium 14, wherein each operation isexecuted by the database server in a specified order, except that if anoperation fails, then each remaining operation is not executed.
 16. Thenon-transitory machine readable medium 15, wherein each of the first andsecond transactions is committed or applied to the database only if eachand every operation has been successfully executed.
 17. Thenon-transitory machine readable medium 11, wherein the database serveron the hardware switching element performs a garbage collectionoperation periodically or when triggered to remove table entries thatare no longer used in the database.
 18. The non-transitory machinereadable medium 17, wherein the database server identifies a table entryis no longer used if the table entry is not part of a root set.
 19. Thenon-transitory machine readable medium 11, wherein the address of thesoftware switching element is an internet protocol (IP) address.
 20. Thenon-transitory machine readable medium 11, wherein the softwareswitching element is a first software switching element and the addressis a first address, wherein the program further comprises a set ofinstructions for: from the network controller, sending a thirdtransaction that instructs the database server on the hardware switchingelement to write to the database a second address of a second softwareswitching element that use the same LFEI, wherein the hardware switchingelement uses the address to establish a tunnel between the hardwareswitching elements and the second software switching element, andwherein the first software switching element, the second softwareswitching element, and the hardware switching element implements thelogical switching element by sending packets over the established tunnelusing the LFEI.